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Claim Rejections - 35 USC § 103 

The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

Claims 1-3 and 13-15 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Basu et al. (2003/0018475) in view of Burnett et al. 
(2004/0133421) and further in view of Girod (6,483,532). 

As to claim 1, Basu, teaches, in "audio-visual speech detection and recognition 
system", a noise reduction system including an audio-visual user interface for combining 
visual features extracted from a digital video sequence with audio features extracted 
from an analog audio sequence including background noise, the system comprising: 

speech sequence detection means for detecting audio signals (Par.0012-0013); 

speech feature extraction and analyzing means; (Par.0038, 0042) 

video sequence detection means for detecting said video sequence (Par.0010); 

visual feature extraction and analysis means for analyzing the detected video 
sequence and extracting said visual features therefrom (Par.0081); and 

a means to prevent background noise from being processed by the system 
based on the derived speech characteristics and to out put speech activity indication 
signal based on the combination of the speech detection and video sequence detection 
means (Par.0094-0097; abstract; Figs.1, 8-10; Par.0088). 
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it is noted that Basu doesn't explicitly teach removing noise from the speech 

signal. 

Burnett, however, teaches a noise suppression system (Fig. 2), where the noise 
is remove from the speech signal according to a result of a voice activity detector 
wherein the voice activity detector includes motion sensors to detect the motion of the 
speaker (Figs.2, 6-1 1 ; Pars.0070-0076, 0082, 0089-0092). 



Burnett and Basu are analogous in that they both are drawn into detecting 
speech using additional information to the acoustic signal, such as image and motion for 
the purpose of handling background noise and therefore their combination and the 
removing of the noise from the speech signal, in Basu teaching, is not unexpected and 
would be obvious for the purpose processing and removing the noise that are mixed 
with the speech signal of the intended speaker. 
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It is also noted that Basu doesn't explicitly teach where the system comprises 
echo cancellation means. 

Girod, however, teaches, in a video assisted audio signal processing system, 
noise reduction system (Fig.3) for modifying a speech signal, including an audio-visual 
user interface for combining visual features extracted from a digital video sequence with 
audio sequence, said system comprising: 

audio signal processing means (324) for processing audio signal; 

video sequence detection means (354) for detecting said video sequence; 

visual feature extraction and analysis means (360) for analyzing the detected 
video sequence and extracting said visual features therefrom; and 

a multi-channel acoustic echo cancellation unit ( 312) configured to perform a 
near-end speaker detection (314) and double-talk detection (318) algorithm based on 
the audio analysis means and the visual detection means and to modify the near end 
speech by cancelling echo (noise) in the speech signal (abstract; Figs. 1-4; Col.1, line 
50-Col.2, line 38). 
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It would have been obvious to one of ordinary skill in the art at the time of 
applicant's invention to modify Basu teaching as claimed, in view of Girod, for the 
purpose of reliably distinguishing between speech that are meant to be processed by 
the system from unintended background speech including acoustic echo thereby 
avoiding false activation of the system. 

As to claim 2, Basu teaches, enabling/disabling the microphone based on 
whether or not the speech energy level detected is below/above a 'given signal level' 
(threshold) (Par.0097, 0094, 0096). 

As to claim 3, Basu teaches where the audio feature extraction and analysis 
means comprises an amplitude detector (Par.0039). 

As to claims 13-15, Basu teaches the corresponding system for reducing noise in 
speech using audio features plus visual speech feature vectors as addressed above for 
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claim 1 in detail, and Burnett teaches removing noise from speech signal using motion 
detectors and acoustic detectors in telephone device and Girod teaches where the 
system disclosed is used in a video communication/telephony application including 
microphone, video camera and speaker (Figs. 103; Claim 8), and the motivation for 
using the Basu system in video-telephony application would be obvious to one skill in 
the art for the purpose of reliably detecting background noise in the communication 
signal. 

Claims 4, 7 and 8 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Basu et al. (2003/0018475) in view of Burnett et al. (2004/0133421) and further 
in view of Wynn (5,706,394) 

As to claim 4, Basu, teaches a method for reducing noise comprising the steps 

of: 

Converting analog speech to digital; 

acoustic feature extraction process by Fourier transforming the magnitudes of 
discrete of samples of speech data; (Par.0038-0039, 0042); and 

detecting speech in an audio signals by analyzing visual features extracted from 
video sequence associated with the audio sequence including current position of face, 
lip or facial expression of the speaker; and 

preventing background noise from being processed by the system based on the 
derived speech characteristics and to out put speech activity indication signal on the 
combination of the audio processing and video sequence detection means (Par.0094- 
0097; abstract; Figs.1, 8-10; Par.0088). 
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Basu doesn't explicitly teach removing noise from the speech signal. 

Burnett, however, teaches a noise suppression system, where the noise is 
remove from the speech signal according to a result of a voice activity detector wherein 
the voice activity detector includes motion sensors to detect the motion of the speaker 
(Figs.2, 6-11; Pars.0070-0076, 0082, 0089-0092). 

Burnett and Basu are analogous in that they both are drawn into detecting 
speech using additional information to the acoustic signal, such as image and motion for 
the purpose of handeling background noise and therefore their combination and the 
removing of the noise from the speech signal, in Basu teaching, would be obvious for 
the purpose processing and removing the noise that are mixed with the speech signal 
that are spoken by the intended speaker. 

Burnett doesn't explicitly teach the claimed process for removing noise. 

Wynn teaches a method for reducing noise in speech, comprising: 

Estimating a noise power density spectrum of background noise based on a 
voice activity detector, inherently the voice representing the user's voice; 

Subtracting the estimated power noise from the speech signal; 

Inverse transforming the signal into time domain where the noise subtracted 
speech signal could be input to speech recognizer (abstract; Col.1, lines 31-35; Col. 8, 
line 65-Col.9, line 11; Col. 16, lines14-20). It would have been obvious to one of ordinary 
skill in the art at the times of applicant to modify Basu system in view of Wynn for the 
purpose of efficiently removing background noise from the speech signal. 
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As to claim 7, Basu and Burnett teach wherein said visual speech 
characteristics are based on detecting, face, opening of a mouth of the speaker, 
detecting the lips of the speaker or detecting other phonetic characteristics 
associated with position and movement of the lips (Par.0043-0046, Figs.2-4— -Figs 
3-10). 

As to claim 8, Basu teaches detecting the voice of the speaker by analyzing 
visual features extracted from video sequences associated with the speech where the 
visual features include mouth movement, face, the lips of the speaker or detecting other 
phonetic characteristics associated with position and movement of the lips (Par.0043- 
0046, Figs.2-4). 

Claims 5, 6, 9 and 10-12 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Basu et al. (2003/0018475) in view of Burnett et al. 
(2004/0133421), Wynn (5,706,394) and Girod (6,483,532). 

As to claim 5, Basu teaches where acoustic-phonetic (visual speech feature 
characters) are derived by an algorithm for extracting the visual feature from video 
sequence associated with audio sequence including movement and position of lip of 
facial expression in an image signal (Par.0081). Burnett teaches removing noise from 
speech signal using motion sensors to detect the speakers movement, the step of 
acoustic echo cancellation as claimed is not taught by Basu, however Girod as 
addressed above for claim 1, teaches a near end acoustic echo signal detection 
cancelling process by utilizing the combination of video detection means and audio 
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processing means, the motivation for combining the two teachings is same as provided 
in claim 1. 

As to claim 6, Girod teaches where the acoustic echo cancellation process 
includes a double talk detection procedure (Fig.3). 

As to claim 9, Wynn teaches where the noise suppressing method comprises 
comparing the spectrum of, inherently delayed, audio input with a voice activity estimate 
(threshold, TH) obtained by amplitude detection of a filtered discrete signal spectrum to 
provide an estimate for a frequency spectrum corresponding to a signal which 
represents a voice of said speaker as well as an estimate for the noise power density 
spectrum of the statistically distributed background noise (Fig. 13; Col. 14, lines14-30; 
Col. 15, lines 2-20). 

As to claims 10 and 12, Basu teaches a speech present estimation means and 
an event detection means where the event detection means comprises the audio 
feature vectors, A, extracted from audio signal and visual speech feature vectors, V, 
extracted from visual sequences and which are representative visual-speech and the 
detection is made on the combinations of the two sets of feature vectors, i.e, the audio 
plus the visual-speech features (Par.0042, 0080). 

As to claim 12, Basu teaches where speech activity estimate features and visual- 
speech activity estimate features are combined/added to form a single audio visual- 
speech feature vector and correlated to audio visual-speech probabilities to make the 
detection decision (Par.103-104, 107) 
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Basu also teaches detecting speech using energy threshold as discussed above. 
Basu however, doesn't explicitly teach where speech/noise estimate is updated as 
claimed. Wynn teaches where the speech activity threshold is updated for every frame 
according to spectrally estimated noise in the speech signal (Fig. 13) and this process 
would have been obvious in Basu system for the purpose of adjusting the energy 
threshold in accordance to the level of the present background noise as well as for 
effectively cancelling background noise in the speech signal. 

As to claim 1 1 , adjusting the frequency band of the filtered signal is inherent in 
Wynn teaching (Col. 15, lines 2-20). 

Response to Arguments 

Applicant's arguments with respect to the claims have been considered but are 
moot in view of the new ground(s) of rejection. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Daniel D. Abebe whose telephone number is 571-272- 
7615. The examiner can normally be reached on monday-friday. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, David Hudspeth can be reached on 571-272-7843. The fax phone number 
for the organization where this application or proceeding is assigned is 571-273-8300. 
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Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
USPTO Customer Service Representative or access to the automated information 
system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 



/Daniel D Abebe/ 

Primary Examiner, Art Unit 2626 



